A comparison of methods for training population optimization in genomic selection

Key message Maximizing CDmean and Avg_GRM_self were the best criteria for training set optimization. A training set size of 50–55% (targeted) or 65–85% (untargeted) is needed to obtain 95% of the accuracy. Abstract With the advent of genomic selection (GS) as a widespread breeding tool, mechanisms to efficiently design an optimal training set for GS models became more relevant, since they allow maximizing the accuracy while minimizing the phenotyping costs. The literature described many training set optimization methods, but there is a lack of a comprehensive comparison among them. This work aimed to provide an extensive benchmark among optimization methods and optimal training set size by testing a wide range of them in seven datasets, six different species, different genetic architectures, population structure, heritabilities, and with several GS models to provide some guidelines about their application in breeding programs. Our results showed that targeted optimization (uses information from the test set) performed better than untargeted (does not use test set data), especially when heritability was low. The mean coefficient of determination was the best targeted method, although it was computationally intensive. Minimizing the average relationship within the training set was the best strategy for untargeted optimization. Regarding the optimal training set size, maximum accuracy was obtained when the training set was the entire candidate set. Nevertheless, a 50–55% of the candidate set was enough to reach 95–100% of the maximum accuracy in the targeted scenario, while we needed a 65–85% for untargeted optimization. Our results also suggested that a diverse training set makes GS robust against population structure, while including clustering information was less effective. The choice of the GS model did not have a significant influence on the prediction accuracies. Supplementary Information The online version contains supplementary material available at 10.1007/s00122-023-04265-6.

Table S1 Summary of the training set optimization methods used. It is important to note that TrainSel always maximizes the selected evaluation criterion and the parameters used were 200 iterations of the genetic algorithm, population size of 200, 5 elite solutions selected, 10 steps of simulated annealing per iteration of the genetic algorithm. nset; number of instances present in the set indicated in the subindex. A; relationship matrix. λ; shrinkage parameter. X; marker matrix. C; contrast matrix. a,b; pondering parameters with any value ≥ 0. T r [·]; trace of a matrix, sumsqr [·]; sum of the squared elements of a matrix, diag(·); main diagonal of a matrix; mean(·); average of all elements of a vector or matrix, TRS; training set, TP; target population. I; identity matrix. If I has a subindex, it indicates its dimensions. Otherwise, it has the dimensions needed for the operations. For all other matrices a subindex indicates that a subset is taken. For instance X T RS;All is the marker matrix whose rows are the individuals in the training set and with all the columns taken.

Method
Mechanism Reference

StratSamp
Random sampling forcing that the amount of individuals selected within each cluster for the training set is proportional to the total size of said cluster in the candidate set.
Isidro et al. (2015) PAM 1. Select k initial random medoids (k = training set size). 2. Build a cluster around each medoid. All lines in the candidate set are assigned to the cluster with the closest medoid. 3. The total cost of the selected set of medoids (T) is calculated as the sum of the dissimilarities of all the elements to their closest medoid. 4. Select a random non-medoid individual to become a new medoid, replacing one of the old ones. 5. T is recalculated and, if it is smaller than it previously was, the new set of medoids is kept. 6. Steps 3-5 are repeated until convergence is reached. CDMEAN2 CDMEAN2 = −mean[diag(CX T P ;All (X ′ T RS;All X T RS;All + λ * I) −1 X ′ T RS;All X T RS;All (X ′ T RS;All X T RS;All + λ * I) −1 X ′ T P ;All C ′ )/diag(CX T P ;All X ′ T P ;All C ′ )] Akdemir (2017) Rscore Rscore = q12/ √ q1q2; q12 = T r[X T T P ;All IJX T P ;All AmatX T RS;All ]; q1 = (n T P − 1) + sumsqr[X T T P ;All IJX T P ;All ]; q2 = sumsqr[Amat T X T T P ;All IJX T P ;All Amat]+ +sumsqr[X T P ;All Amat T IJAmatX T RS;All ]; Amat = X T T RS;All (X T RS;All X T T RS;All + I(1/n markers )) −1 ; IJ = In T P − In T P (1/n T P ) Ou and Liao (2019) gAvg GRM gAvg GRM = a · mean(A T RS;T P ) − b · mean(A T RS;T RS ) Derived from Atanda et al. (2021) Fig. S1 Overview of the different CDmean criteria displayed on a toy example with 2 clusters (C1 and C2) on the first two principal components (PC1 and PC2). We aim to use TrainSel to select an optimal training set in an untargeted scenario (the target population is the candidate set). The desired TRS consists on 7 individuals from a candidate set of 14 separated in two clusters of size 6 and 8. Both WIClustCDmean and OvClustCDmean sampled a proportional number of individuals from each cluster.

Table S2
Average accuracy across training set optimization methods and training set sizes for all possible combinations of datasets, traits and models. The total average is the mean for all values for each model. HT; plant height, FT; flowering time, YLD; yield, FP; florets per panicle, PC; protein content, MO; moisture content, R8; the number of reproductive nodes and pods in the main stem during the reproductive stages R7-R8, DBH; diameter at breast height, DE; density, ST; standability, AN; anthesis date, Sim; simulated trait with heritability 0.5.

Table S3
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for maize dataset and FT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S4
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for maize dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S5
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for maize dataset and YLD trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S6
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for maize dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S7
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for rice dataset and FT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S8
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for rice dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S9
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for rice dataset and YLD trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S10 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for rice dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S11 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for ricePopStr dataset and FP trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S12 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for ricePopStr dataset and FT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S13 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for ricePopStr dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S14 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for ricePopStr dataset and PC trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S15 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set) for ricePopStr dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S16 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for sorghum dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S17 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for sorghum dataset and MO trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S18 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for sorghum dataset and YLD trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.
Table S19 Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for sorghum dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S20
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for soybean dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S21
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for soybean dataset and R8 trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S22
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for soybean dataset and YLD trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S23
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for soybean dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S24
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for spruce dataset and DBH trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S25
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for spruce dataset and DE trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S26
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for spruce dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S27
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for spruce dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S28
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for switchgrass dataset and AN trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S29
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for switchgrass dataset and HT trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S30
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for switchgrass dataset and ST trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S31
Average accuracy and its corresponding standard error of the mean (SEM) across the 40 iterations for all training set optimization methods, models and training set (TRS) sizes (expressed as percentage of the candidate set (CS)) for switchgrass dataset and the simulated trait. If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed.

Table S32
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the maize dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S33
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the maize dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S34
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the maize dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80percentage of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S35
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the rice dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S36
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the rice dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S37
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the rice dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S38
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the soybean dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S39
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the soybean dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Soybean BayesB
Table S40 Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the soybean dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S41
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the spruce dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S42
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the spruce dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S43
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the spruce dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S44
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the sorghum dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant different between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S45
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the sorghum dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S46
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the sorghum dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S47
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the switchgrass dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S48
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the switchgrass dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S49
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the switchgrass dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S50
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the GBLUP model in the ricePopStr dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S51
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the BayesB model in the ricePopStr dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Table S52
Percentage of gain in Area Under the Curve (AUC) for every optimization method and trait compared to random sampling using the RKHS model in the ricePopStr dataset. Area under the curve was generated using agricolae package. AUC is the area under the curves generated when plotting the accuracy of the model against the training set size. As no optimization was performed when using the entire candidate set, results in the table shows the AUC for training set size from 10 to 80% of the candidate set. We present the mean and the standard error of the mean (SEM) over the 40 repetitions. Asterisks represent whether or not there is significant difference between the AUC for the training set optimization method and random sampling in a Wilcoxon signed-rank test. Sig: Significance. * P < 0.05, * * P < 0.01, * * * P < 0.001.

Fig. S2
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the maize dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S3
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the rice dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S4
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the ricePopStr dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S5
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the sorghum dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S6
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the soybean dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S7
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the spruce dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S8
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the switchgrass dataset with a training set size of 10% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Fig. S9
Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the maize dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Rice dataset, training set size = 40% of candidate set
Fig. S10 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the rice dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

RicePopStr dataset, training set size = 40% of candidate set
Fig. S11 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the ricePopStr dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.
Fig. S12 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the sorghum dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Soybean dataset, training set size = 40% of candidate set
Fig. S13 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the soybean dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Spruce dataset, training set size = 40% of candidate set
Fig. S14 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the spruce dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Switchgrass dataset, training set size = 40% of candidate set
Fig. S15 Frequency of selection of individuals for the training set across the 40 iterations for all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The frequency of the individuals randomly sampled for the test sets is also shown in plot A. All plots belong to the switchgrass dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Red colour is used to highlight the 5% most frequently selected individuals, blue colour corresponds to the next 10% most selected individuals and grey is used for the rest.

Note 2. Representative iteration
With the aim to validate the results obtained using the frequency of selection 1137 ( Figures S2 -S15), we employed a representative iteration of the cross validation 1138 and plotted which individuals were selected by each optimization method for said 1139 iteration ( Figures S16 -S22). The representative iteration was the one whose accu-1140 racies across traits, models and optimization methods had the highest correlation to note that in targeted CDmean this trend is no longer observed and the genetic 1166 space is evenly sampled, which is consistent with its good performance regardless 1167 of population structure (Table 3).

Maize dataset, training set size = 40% of candidate set
Fig. S16 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the maize dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.

Rice dataset, training set size = 40% of candidate set
Fig. S17 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the rice dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.

RicePopStr dataset, training set size = 40% of candidate set
Fig. S18 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the ricePopStr dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.

Sorghum dataset, training set size = 40% of candidate set
Fig. S19 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the sorghum dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.

Soybean dataset, training set size = 40% of candidate set
Fig. S20 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the soybean dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.
Spruce dataset, training set size = 40% of candidate set Fig. S21 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the spruce dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.
Switchgrass dataset, training set size = 40% of candidate set Fig. S22 Selected individuals for the training set in a single iteration by all training set optimization methods (plots B-P). If "targ" is added at the end of the name of a method, it corresponds to targeted optimization. Otherwise, untargeted optimization was performed. The individuals randomly selected as the test set are also shown in plot A. All plots belong to the switchgrass dataset with a training set size of 40% of the candidate set. The two axes in the plots are the first two principal components that summarize the genetic space and each point is an individual within the dataset. Blue colour indicates non-selected individuals and red colour indicates selected individuals. The iteration to be plotted was chosen as the one whose accuracies across traits, models and optimization methods achieved the highest correlation with the corresponding average values for the 40 iterations.

Fig. S23
Correlation between the average performance of the optimization methods in the non-simulated traits and the population structure of the datasets. The performance of the methods was measured as the gain in area under the curve (AUC) relative to random sampling across training set sizes for each dataset-trait-model combination. population structure was measured as the percentage of variance explained by the first 20 principal components of a principal component analysis made over the marker data.

Fig. S24
Correlation between the average performance of the optimization methods for each trait and its heritability. The performance of the methods was measured as the gain in area under the curve (AUC) relative to random sampling across training set sizes for each datasettrait-model combination. It is important to note that, before calculating the average, the values within each dataset-trait-model combination were normalized.

Table S53
Time needed in seconds to run different training set optimization methods for a dataset whose size is indicated in the header. (X n×p ) indicates that the marker matrix (X ) for the dataset contains n individuals and p markers per line. The TrainSel parameters used are 100 iterations, 100 training sets in the population of the genetic algorithm, 5 simulated annealing steps per iteration and 5 elite lines used as parents of the next generation. All methods were tested for untargeted optimization with candidate set = 85% of the total, training set size = 50% of the candidate set and target population = entire candidate set for all methods except CDmean, where the target population is the remaining set (individuals in the candidate set not selected for the training set). The values shown in this table are the average over 20 repetitions. The time complexity shown is an approximation that assumes that the training set, target population and candidate set are a constant fraction of n and that the time complexity of the product and the inversion for a n × n matrix is O(n 3 ). It is important to note that the time complexity specified in the last column and the empiric values obtained for the computational time don't match for CDmean, Rscore and the Avg GRM variants because the time complexity refers only to the evaluation criterion, but it does not take into account the genetic algorithm and simulated annealing performed by TrainSel. However, for large values of n and p the time taken by the evaluation criterion dominates and the time complexity shown is a good approximation of the real one. For PAM, the time complexity doesn't take into account the time needed to calculate the dissimilarity matrix and that explains the disparity between it and the empiric values. Stratified sampling was not included in this table because it is close to instant for any dataset size. All other methods tested in this work are variations of the ones shown in this table and their computational time would be very similar to them.       Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in maize dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.

Fig. S32
Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in rice dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.

Fig. S33
Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in ricePopStr dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.

Fig. S34
Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in sorghum dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.

Fig. S35
Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in spruce dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.

Fig. S36
Average correlation between the accuracy and the evaluation metrics across the tested training set sizes (10, 20, 40, 60 and 80% of the candidate set) for all traits in switchgrass dataset. The horizontal axis contains the optimization methods used to obtain the different training sets. For each training set the evaluation metrics in the vertical axis were computed and were correlated with the accuracy obtained using GBLUP model.